Skip to main content

3.4 Create variables with aggregated values - aggregate

With the aggregate command, you can easily create new variables that contain aggregated calculations based on existing variables. The calculations can be grouped by selected categories.

Let's say you have a full population dataset at the person level that contains, among other things, a variable with data on the municipality of residence. You want to add contextual information about the number of residents in each municipality, where all individuals residing in the same municipality receive the same value for the number of residents. Then you simply set up the following expression:

aggregate (count) gender -> residents, by(municipality)

This means counting the number of individuals with a value for gender in the respective municipalities given by the by() expression, which is equivalent to performing a full census (since all individuals have a value for gender). The result is a new variable, residents, which contains the number of residents in the individual's municipality of residence. For the count to be correct, it is important that your dataset contains a total population (all residents in Norway at the given time).

Example where we create contextual variables that show the number of residents, the number of job seekers (registered with NAV) and the proportion of job seekers in relation to the population in each municipality. By counting everyone with a value on the variable job_seeker, you are in practice making a count of the number of job seekers (only job seekers at NAV have a value on this variable, all others are missing on this and are therefore excluded from the count):

 require no.ssb.fdb:38 as db

create-dataset jobseeker_analysis
import db/BEFOLKNING_KOMMNR_FAKTISK 2023-01-01 as municipality
import db/BEFOLKNING_KJOENN as gender
import db/ARBEIDSOKER_AS_HOVED 2023-01-31 as job_seeker

tabulate job_seeker

aggregate (count) gender -> residents job_seeker -> job_seekers, by(municipality)

generate share_job_seekers = job_seekers / residents

//Creates statistics on the number of individuals who are resident in municipalities grouped by population size and number of jobseekers
histogram residents, freq
histogram job_seekers, freq
 

Another scenario: You have a dataset with the variables gender and salary. You want to add contextual information about average salary related to men and women. Then you set up this expression that creates the variable average_salary_gender:

aggregate (mean) salary -> average_salary_gender, by(gender)

Example where we create variables that show average salary (yearly wage income) for gender, municipality of residence and county of residence, respectively, and perform an analysis of the effect of average salary in the municipality of residence on one's own earnings:

 require no.ssb.fdb:38 as db

create-dataset salary_analysis
import db/INNTEKT_WLONN 2022-12-31 as salary
import db/BEFOLKNING_KJOENN as gender
import db/BEFOLKNING_KOMMNR_FAKTISK 2023-01-01 as municipality

generate county = substr(municipality,1,2)

//First create salary statistics divided by gender and county
summarize salary
tabulate gender, summarize(salary) mean freq
tabulate county, summarize(salary) mean freq

//Create new variables that contain average salary for respectively gender, municipality and county
aggregate (mean) salary -> average_gender_salary, by(gender)
aggregate (mean) salary -> average_municipality_salary, by(municipality)
aggregate (mean) salary -> average_county_salary, by(county)

//Analyzes the correlation between individual salary and average salary in the municipality one lives in. By running linear regression we control for effects due to gender and county of residence
correlate salary average_municipality_salary
regress salary i.gender i.county average_municipality_salary
 

The measure types available for aggregate() are the same as for collapse():

  • count (number of units with valid values)
  • mean
  • median
  • min
  • max
  • sum
  • sd (standard deviation)
  • semean (standard error of the mean)
  • sebinomial (binomial standard error of the mean)
  • percent (percentage of valid values)
  • iqr (interquartile range)

Note that aggregate() allows you to create multiple variables with aggregate calculations, and with different types of calculations.

Examples:

 aggregate (mean) salary -> average_salary (min) salary -> minimum_salary (max) salary -> maximum_salary, by(gender)

aggregate (mean) salary -> average_salary wealth -> average_wealth electricity_consumption -> average_consumption (max) electricity_consumption -> maximum_consumption, by(municipality)
 

Aggregate vs collapse

In short, the aggregate() command makes it much easier to create aggregated values ​​than using the collapse() command (you can read more about this command here):

  • aggregate(): Variables are created with ready-made aggregated calculations in existing datasets

  • collapse(): The entire dataset is aggregated based on the grouping variable used. If this is a municipality variable and your dataset has person as the unit level, your dataset will be aggregated from person to municipality level after the operation. Furthermore, the variable you aggregate (e.g. salary) will be replaced with the aggregated value for the new aggregation level (municipality).

  • Both commands generate aggregate statistics that are grouped by a categorical variable, but where collapse() aggregates the entire dataset up to the level at which the values ​​are grouped (e.g. municipalities), aggregate() will instead only create a new variable that contains the aggregated calculations. Furthermore, units with the same value for the grouping variable (e.g. all with the same municipality of residence) will get the same aggregated value (e.g. average value for the municipality of residence).

You will get the same result through collapse() as for aggregate() if you use the aggregated dataset and link it to the raw dataset with the grouping variable (municipality) as the linking key. You must then create two identical datasets where one is used for aggregation, while the other (the raw dataset) is used as the base/receiver for linking the aggregated values ​​(via the grouping variable).